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ABSTRACT 

Biologists are increasingly using databases for storing and 
managing their data. Biological databases typically consist 
of a mixture of raw data, metadata, sequences, annotations, 
and related data obtained from various sources. Current 
database technology lacks several functionalities that are 
needed by biological databases. In this paper, we intro- 
duce bdbms, an extensible prototype database management 
system for supporting biological data, bdbms extends the 
functionalities of current DBMSs with: (1) Annotation and 
provenance management including storage, indexing, ma- 
nipulation, and querying of annotation and provenance as 
first class objects in bdbms, (2) Local dependency track- 
ing to track the dependencies and derivations among data 
items, (3) Update authorization to support data curation via 
content-based authorization, in contrast to identity-based 
authorization, and (4) New access methods and their sup- 
porting operators that support pattern matching on vari- 
ous types of compressed biological data types. This paper 
presents the design of bdbms along with the techniques pro- 
posed to support these functionalities including an exten- 
sion to SQL. We also outline some open issues in building 
bdbms. 

1. INTRODUCTION 

Biological databases are essential to biological experimen- 
tation and analysis. They are used at different stages of life 
science research to deposit raw data, store interpretations of 
experiments and results of analysis processes, and search for 
matching structures and sequences. As such, they represent 
the backbone of life sciences discoveries. However, current 
database technology has not kept pace with the prolifer- 
ation and specific requirements of biological databases |25l 
137] . In fact, the limited ability of database engines to furnish 
the needed functionalities to manage and process biological 
data properly has become a serious impediment to scientific 
progress. 

In many cases, biologists tend to store their data in flat 
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files or spreadsheets mainly because current database sys- 
tems lack several functionalities that are needed by biolog- 
ical databases, e.g., efficient support for sequences, anno- 
tations, and provenance. Once the data resides outside a 
database system, it loses effective and efficient manageabil- 
ity. Consequently, many of the advantages and functionali- 
ties that database systems offer are nullified and bypassed. 
It is thus important to break this inefficient and ineffective 
cycle by empowering database engines to operate directly 
on the data from within its natural habitat; the database 
system. 

Biological databases evolve in an environment with 
rapidly changing experimental technologies and semantics 
of the information content and also in a social context that 
lacks absolute authority to verify correctness of information. 
Furthermore, because the only authority is the scientific 
community itself, biological databases often require some 
form of community-based curation. These characteristics 
make it difficult, even using good design strategies, to com- 
pletely foresee the kinds of additional information (termed 
annotations) that, over time, may become necessary to at- 
tach to data in the database. 

In this paper, we propose bdbms, an extensible proto- 
type database engine for supporting and processing biolog- 
ical databases. While there are several functionalities of 
interest, we focus on the following key features: (1) Anno- 
tation and provenance management, (2) Local dependency 
tracking, (3) Update authorization, and (4) Non-traditional 
and novel access methods, bdbms will make fundamental 
advances in the use of biological databases through new 
native and transparent support mechanisms at the DBMS 
level. 

Annotations and provenance data are treated as first-class 
objects inside bdbms. bdbms provides a framework that al- 
lows adding annotations/provenance at multiple granulari- 
ties, i.e., table, tuple, column, and cell levels, archiving and 
restoring annotations, and querying the data based on the 
annotation/provenance values. In bdbms, we introduce an 
extension to SQL, termed Annotation-SQL, or A-SQL for 
short, to support the processing and querying of annotation 
and provenance information. A-SQL allows annotations and 
provenance data to be seamlessly propagated with query an- 
swers with minimal user programming. 

In bdbms, we propose a systematic approach for tracking 
dependencies among database items. As a result, when a 
database item is modified, bdbms can track and mark any 
other item that is affected by this modification and that 
needs to be re- verified. This feature is particularly desirable 
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Figure 1: Local dependencies 



in biological databases because many dependencies cannot 
be computed using coded functions. For example (refer to 
Figure [1} , protein sequences are derived from gene (DNA) 
sequences. If a gene sequence is modified, the correspond- 
ing protein sequence(s), derived calculated quantities (such 
as molecular weight), and annotations may become invalid. 
Similarly, we may store descriptions of chemical reactions, 
e.g., substrates, reaction parameters, and products. If any 
of the substrates in the reaction are modified, the products 
of the reaction are likely to require re-evaluation. However, 
since these dependencies are complex and involve lab ex- 
periments and external analysis, database systems cannot 
systematically re-compute the other affected items. Lack 
of system support to automatically track such dependencies 
raises significant concerns on the quality and the consistency 
of the data maintained in biological databases. 

Authorizing database operations is also one of the features 
that we extended in bdbms. Current database systems sup- 
port the GRANT/REVOKE access models that depend only 
on the identity of the user. In bdbms, we propose the con- 
cept of content-based authorization, i.e., the authorization 
is based not only on the identity of an updater but also on 
the content of the updated data. For example, lab members 
may have the authority to update a given data set. However, 
for credibility and reliability of the data, these updates have 
to be revised by the lab administrator. The lab administra- 
tor can then approve or disapprove the operations based on 
their contents. In the mean time, users may be allowed to 
view the data pending its approval/disapproval. 

The other key feature in bdbms is to provide access meth- 
ods for supporting various types of biological data. Our goal 
is to design and integrate non-traditional and novel access 
methods inside bdbms. For example, sequences and multi- 
dimensional data are very common in biological databases 
and hence there is a need to integrate new types of index 
structures such as SP-GiST [3] U [16] [22] and the SBC- 
tree [IT] inside bdbms along with their supporting operators. 
SP-GiST is an extensible indexing framework for support- 
ing multi-dimensional data while the SBC-tree is an index 
structure for indexing and querying compressed sequences 
without decompressing them. 

The rest of the paper is organized as follows. In Section[5] 
we present the overall architecture of bdbms. In Sections [3]- 
[7] we present each of the bdbms key features. Section [8] 
overviews the related work, and Section [5] contains conclud- 
ing remarks. 

2. BDBMS SYSTEM ARCHITECTURE 

The main components of bdbms are the annotation man- 
ager, the dependency manager, and the authorization man- 
ager. A-SQL is bdbms's extended SQL that supports anno- 
tation (Section [3]) and authorization commands (Section [6]). 



bdbms's annotation manager is responsible for handling the 
annotations in an annotation storage space (Section[3]). The 
dependency manager is responsible for handling the depen- 
dencies and derivations among database items. These de- 
pendencies are stored in a dependency storage space (Sec- 
tion [5]). The authorization manager handles content-based 
authorizations as well as the standard GRANT/REVOKE 
authorizations over the database (Section [6]). Index struc- 
tures are available in bdbms in support of the multidimen- 
sional and compressed data (Section [7]). 

3. ANNOTATION MANAGEMENT 

Annotations are extra information linked to data items 
inside a database. They usually represent users' com- 
ments, experiences, related information that is not mod- 
eled by the database schema, or the provenance (lineage) 
of the data. Adding and retrieving annotations represent 
an important way of communication and interaction among 
database users. In biological databases, annotations are 
used extensively to allow users to have a better understand- 
ing of the data, e.g., how a piece of data is obtained, why 
some values are being added or modified, and which exper- 
iments or analyses are being performed to obtain a set of 
values. Annotations can be also used to track the prove- 
nance of the data, e.g., from which source a piece of data 
is obtained or which program is used to generate the data. 
Tracking the provenance of the data is very important in 
assessing the value and credibility of the data and in giving 
credit to the original data generators. 

Users can annotate the data at multiple granularities, e.g., 
annotating an entire table, an entire column, a subset of the 
tuples, a few cells, or a combination of these. 

Despite their importance, annotations are not systemat- 
ically supported by most database systems. While anno- 
tation management has been addressed in previous works, 
e.g., [7] [5] [TO] [35], most of the proposed techniques usu- 
ally assume simple annotation schemes and focus mainly on 
annotation propagation, i.e., propagating annotations along 
with the query answers. Other aspects of annotations man- 
agement, e.g., mechanisms for their insertion, archival, and 
indexing as well as more efficient annotation schemes such 
as multi-granularity schemes, have not been addressed. 

In bdbms, we address several challenges and requirements 
of annotation management. We highlight these challenges 
and requirements through the following example. We con- 
sider two gene tables, DBl_Gene and DB2_Gene that have been 
obtained from two different databases (Refer to Figures [2] 
and [3] for illustration). Each table has a set of annotations 
attached to it. We assume a straightforward storage scheme 
for storing the annotations, e.g., the one used in [T], where 
each column in the database has an associated annotation 
column to store the annotations (Figure [3]). 

Adding annotations: Users want to annotate their data 
at various granularities in a transparent way. That is, how 
or where the annotations are stored should be transparent 
from end-users. However, current database systems do not 
provide a mechanism to facilitate annotating the data. For 
example, to add annotation A2 over table DBl_Gene (Fig- 
ure [2}, the user has to know that the annotations are stored 
in the same user table in columns Ann_GID, Ann-GName, 
and Ann_GSequence. Then, the user issues an UPDATE 
statement to update these columns by adding A 2 to the de- 
sired annotation cells (Figure [3]). 
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Figure 2: Annotating tables DBl_Gene and DB2_Gene 
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Figure 3: Simple annotation storage scheme: Every data column has a corresponding annotation column 



To support data annotation, the system has to provide 
new mechanisms for seamlessly adding annotations at var- 
ious granularities. It is essential to provide new expressive 
commands as well as visualization tools that allow users to 
add their annotations graphically. 

Storing annotations at multiple granularities: As 
in Figure users may annotate a single cell, e.g., A3, few 
cells, e.g., Al and B2, entire rows, e.g., A2 and B4, or entire 
columns, e.g., B3. Multi-granularity annotations motivate 
the need for efficient storage and indexing schemes. Other- 
wise, storing and processing the annotations can be very ex- 
pensive. For example, annotations A 2 and B3 are repeated 
in the annotation columns 6 and 5 times, respectively. The 
need for such efficient schemes is especially important in the 
context of provenance where a single provenance record can 
be attached to many tuples or even entire columns or tables. 

Categorizing annotations: Although all annotations 
are metadata, they may have different importance, mean- 
ing, and creditability. For example, annotations that are 
added by a certain user or group of users can be more im- 
portant than annotations added by the public or unknown 
users. Moreover, annotations that represent the lineage of 
the data have different purpose and importance from the 
annotations that represent users' comments. For example, 
annotations A 2 and B3 represent the lineage of some data, 
and users may be interested only in these annotations. As 
will be discussed later in the paper, the different types of 
annotations will also have an impact on the storage mech- 
anism adopted for each type. This diversity in annotations 
motivates the need for separating or categorizing the anno- 
tations, bdbms provides a mechanism that allows users to 
categorize their annotations at the storage, query process- 
ing, and annotation propagation levels. 

Archiving annotations: Users may need to archive or 
delete annotations as they become obsolete, old, or simply 
invalid. Archived annotations should not be propagated to 
users along with query answers. For example, annotation 
B5 in Figure [2] states that gene JW0080 has an unknown 



function. But if the function of this gene becomes known 
and gets added to the database, then B5 becomes invalid 
and users do not want to propagate this annotation along 
with query answers. Without providing a mechanism for 
archiving annotations, the archival operation may not be an 
easy task. For example, to archive annotation B5, the user 
needs to find out which tuples/cells in the database has B5, 
then the contents of each of these cells are parsed to archive 
then delete B5. 

Propagating annotations: A key requirement in 
allowing annotation propagation is to simplify users' 
queries. This can be only achieved by providing database 
system support for annotation propagation; for example, 
by extending the query operators. Otherwise, users' queries 
may become complex and user-unfriendly. For example, 
consider a simple query that retrieves the genes that 
are common in DBl_Gene and DB2_Gene along with their 
annotations (Figure [3]). To answer this query, the user has 
to write the following SELECT statements (a-c): 

(a) Ri(GID, GName, GSequence) = 

SELECT GID, GName, GSequence 
FROM DBl.Gene 
INTERSECT 

SELECT GID, GName, GSequence 
FROM DB2.Gene; 

In Step (a), the user selects only the data columns 
from both gene tables, i.e., GID, GName, GSequence, and 
performs the intersection operation. 

(6) R2(GID, GName, GSequence, Ann.GID, 
AnnJGN ame, Ann JG Sequence) = 
SELECT R.GID, R.GName, R.GSequence, 

G. Ann.GID, G.Ann_GName, G.Ann_G Sequence 
FROM R_l R, DBl.Gene G 
WHERE R.GID = G.GID; 



CREATE ANNOTATION TABLE <ann_table_name> 
ON <user table name> 



DROP ANNOTATION TABLE <ann_table_name> 
ON <user table name> 



Figure 4: The A-SQL commands CREATE and 
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Figure 5: Compact storage for annotations 

In Step (b), the user joins the output from Step (a) back 
with Table DBl_Gene in order to retrieve the annotations 
from this table. Notice that we cannot select the annotation 
columns in Step (a) because, since the annotation values 
in the annotation attributes may vary in the two tables, in 
this case the intersection operation would not return any 
tuples. 

(c) R 3 (GID, GName, GSequence, AnnJGID, 
AnnjGName, Ann _G Sequence) = 
SELECT R.GID, R. GName, R. GSequence, 
R.Ann.GID+G.Ann.GID, 
R.Ann_GName+G.Ann_GName, 
R.Ann_GSequence+G.Ann_GSequence 
FROM R_2 R, DB2.Gene G 
WHERE R.GID = G.GID; 

In Step (c), a join is performed between ife and DB2_Gene 
to consolidate the annotations from DB2_Gene with i?2's an- 
notations, where + is the annotation union operator. 

The main reason for the complexity of querying and prop- 
agating the annotations is that users view annotations as 
metadata, whereas the DBMSs view annotations as normal 
data. For example, from a user's view point, the two tu- 
ples corresponding to genes JW0080 and JW0055 in Table 
DBl_Gene are identical to those in Table DB2-Gene (Fig- 
ure |3J. They only have different annotations. Whereas, 
from the database view point, these tuples are not identical 
because annotations are viewed as normal attribute data. 
As a result, users' queries may become complex in order to 
overcome the mismatch in interpreting the annotations. 

In the following subsections, we introduce our initial in- 
vestigations through bdbms to address the challenges and 
requirements highlighted above along with some preliminary 
results. 



bdbms allows a user relation to have multiple annotation 
tables attached to it. For example, table DBl_Gene may have 
an annotation table that stores the provenance information 
and another annotation table that stores users' comments. 
To create an annotation table over a given user relation, the 
A-SQL command CREATE ANNOTATION TABLE (Fig- 
ure HJ is used. CREATE ANNOTATION TABLE allows 
users to design and categorize their annotations at the stor- 
age level. This categorization will also facilitate annotation 
propagation (discussed in Section |3.4[) . where users may re- 
quest propagating a certain type of annotations. To drop an 
annotation table, the DROP ANNOTATION TABLE com- 
mand (Figure [4} is used. 

To efficiently store the annotations, we are investigating 
several storage and indexing schemes. One possible direc- 
tion is to consider compact representation of annotations 
that would improve the system performance with respect to 
storage overhead, I/O cost to retrieve the annotations, and 
the query processing time. For example, instead of stor- 
ing the annotations at the cell level, we may store some of 
the annotations at coarser granularities. For instance, the 
annotations over Table DB2_Gene (Figure [2]) can be repre- 
sented as rectangles attached to groups of contiguous cells 
as illustrated in Figure [5] where DB2_Gene is viewed as two- 
dimensional space, e.g., columns represent the X-axis and 
tuples represent the Y-axis. In this case, an annotation over 
any group of contiguous cells can be represented by a sin- 
gle annotation record. So, in general, an annotation over a 
subset of a table will map to multiple rectangular regions. 
Other annotation characteristics that may need to be taken 
into account include whether the annotation is linked to 
multiple data items in different tables or is linked to very 
few specific cells. 

3.2 Adding Annotations at Multiple Granu- 
larities 

To add annotations using A-SQL, we propose the ADD 
ANNOTATION command (Figure The annota- 

tion_tablejnames specifies to which annotation table(s) the 
added annotation will be stored. The annotation_body spec- 
ifies the annotation value to be added. The output of the 
SQL statement specifies the data to which the annotation is 
attached. Since annotations may contain important infor- 
mation that users want to query, we plan to support XML- 
formatted annotations. That is, annotation_body is an XML- 
formatted text. In this case, users can (semi-)structure their 
annotations and make use of XML querying capabilities over 
the annotations. The output of the SQLstatement can be at 
various granularities, e.g., entire tuples, columns, or group 
of cells. For example, to add annotation B3 over the en- 
tire GSequence column in Table DB2_Gene (as illustrated in 
Figure HJ, we execute the following ADD ANNOTATION 
command: 

ADD ANNOTATION 

TO DB2.Gene.G Annotation 

VAL UE '< Annotation > 

obtained from GenoBase 
< /Annotation > ' 

ON (Select G. GSequence 
From DB2.Gene G); 



3.1 Storing and Indexing Annotations 



In this case, the annotation is attached to the entire GSe- 



ADD ANNOTATION 

TO <annotationjable_names> 
VALUE <annotation_body> 
ON <SQL_statement> 

(a) 



SELECT [DISTINCT] C, [PROMOTE (C,, C k , ...)], ... 
FROM Relation_name [ANNOTATION^,, S 2 , ...)], ... 
[WHERE <data_conditions>] 
[AWH ERE <annotation_condition>] 
[GROUP BY <data_columns> 

[HAVING <data_condition>] 

[AHAVING <annotation_condition>] ] 
[FILTER <filter_annotation_condition>] 



ARCHIVE ANNOTATION RESTORE ANNOTATION 

FROM <annotationjable_names> FROM <annotationjable_names> 

[BETWEEN <time1> AND <time2>] [BETWEEN <time1> AND <time2>] 

ON <SELECT_statement> ON <SELECT_statement> 

(b) (C) 

Figure 6: The A-SQL commands ADD, ARCHIVE, 
and RESTORE 

quence column because no WHERE clause is specified. The 
annotation is stored in the annotation table GAnnotation. 
Notice that < Annotation > is the XML tag that encloses 
the annotation information. 

Similarly, to annotate an entire tuple, e.g., annotation B5, 
we execute the following ADD ANNOTATION command: 
ADD ANNOTATION 
TO DB2.Gene.G Annotation 
VAL UE '< Annotation > 

This gene has an unknown function 
< /Annotation > ' 
ON (Select G. * 

From DB2.Gene G 
WHERE GID = 'JW0080'); 

In this case, the annotation is attached to the entire tuples 
returned by the query since all the attributes in the table 
are selected. 

To allow users to link annotations to database operations, 
i.e., INSERT, UPDATE, or DELETE, the SQL.statement 
will be an INSERT, UPDATE or DELETE statement. For 
example, instead of inserting a new tuple and then anno- 
tating it by issuing a separate ADD ANNOTATION com- 
mand, users can insert and annotate the new tuple instantly 
by enclosing the insert statement inside the ADD ANNO- 
TATION command. For the delete operation, the deleted 
tuples will be stored in separate log tables along with the an- 
notation that specifies why these tuples have been deleted. 
Notice that the standard system recovery log cannot be used 
for this purpose as the users need the freedom to structure 
their annotation schemas the way they want, which system 
recovery logs do not support. 

We plan to add a visualization tool to allow users to anno- 
tate their data in a transparent way. The visualization tool 
displays users' tables as grids or spreadsheets where users 
can select one or more cells to annotate. Oracle address 
the integration of database tables with Excel spreadsheets 
to make use of Excel visualization and analysis power [2] . In 
bdbms, we plan to add this integration feature to facilitate 
adding and visualizing annotations. 

3.3 Archiving and Restoring Annotations 

Archival of annotations allows users to isolate old or in- 
valid annotations from recent and valuable ones. In bdbms, 
we support archival of annotations instead of permanently 
deleting them because biological data usually has a degree 
of uncertainty and old values may turn out to be the correct 
values. Archiving annotations gives users the flexibility to 



Figure 7: The A-SQL SELECT command 

restore the annotations back if needed. Unlike other anno- 
tations, archived annotations are not propagated to users 
along with the query answers. However, if archived annota- 
tions are restored, then they will be propagated normally. 

To archive and restore annotations, we introduce the 
ARCHIVE ANNOTATION (Figure [!») and RESTORE 
ANNOTATION (Figure (S^c)) commands, respectively. The 
FROM clause specifies from which annotation table(s) the 
annotations will be archived/restored. The optional clause 
BETWEEN specifies a time range over which the anno- 
tations will be archived/restored. This time corresponds 
to the times-tamp assigned to each annotation when it is 
first added to the database. The output from the SE- 
LECT statement specifies the data on which the annotations 
will be archived/restored. In addition, the output from the 
SELECT statement can be at multiple granularities, as ex- 
plained in the ADD ANNOTATION command. 

3.4 Annotation Propagation and Annotation- 
based Querying 

To support the propagation of annotations and querying 
of the data based on their annotations, we introduce the 
A-SQL command SELECT, given in Figure [3 A-SQL SE- 
LECT extends the standard SELECT by introducing new 
operators and extending the semantics of the standard op- 
erators. We introduce the new operators ANNOTATION, 
PROMOTE, AWHERE, AHAVING, and FILTER. 

The ANNOTATION operator allows users to specify 
which annotation table(s) to consider in the query. Using 
the ANNOTATION operator, users can propagate their an- 
notations transparently. That is, users do not have to know 
how or where annotations are stored. Instead, users only 
specify which annotations are of interest. 

The PROMOTE operator allows users to copy annota- 
tions from one or more columns, possibly not in the projec- 
tion list, to a projected column. For example, if column GID 
is projected from Table DBl_Gene, then Annotation A3 will 
not be propagated unless the annotations over GSequence 
are copied to GID. 

The AWHERE and AHAVING clauses are analogous to 
the standard WHERE and HAVING clauses except that the 
conditions of AWHERE and AHAVING are applied over the 
annotations. That is, AWHERE and AHAVING pass a tu- 
ple along with all its annotations only if the tuple's annota- 
tions satisfy the given AWHERE and AHAVING conditions. 
On the other hand, the FILTER clause passes all the data 
tuples of the input relation (keeps user's data intact) but it 
filters the annotations attached to each tuple. That is, any 
annotation that does not satisfy filter_annotation_condition 




Lab experiment 



Local insertion- 



What is the source of 
this value at time T? 



Where do these 



overwrite values come from? 



Figure 8: Data provenance at multiple granularities 



is dropped. 

The standard operators, e.g., projection, selection, and 
duplicate elimination, are also extended to process the an- 
notations attached to the tuples. For example, the projec- 
tion operator selects some user attributes from the input 
relation and passes only the annotations attached to those 
attributes. For example, projecting column GID from Ta- 
ble DB2_Gene (Figure [2]) results in reporting GID data along 
with annotations Bl, B4, and B5 only. The selection oper- 
ators in WHERE and HAVING select tuples from the input 
relation based on conditions applied over the data values. 
The selected tuples are passed along with all their annota- 
tions. For example, selecting the gene with GID = JW0080 
from Table DB2_Gene results in reporting the first tuple in 
DB2_Gene along with annotations Bl, B3, and B5. Opera- 
tors that group or combine multiple tuples into one tuple, 
e.g., duplicate elimination, group by, union, intersect, and 
difference, are also extended to handle the annotations at- 
tached to the tuples. These operators union the annotations 
over the grouped or combined tuples and attach them to the 
output tuple that represents the group. 

While defining the above commands and operators is only 
the first step in supporting annotations and other features 
within bdbms, we need to define for each A-SQL operator 
its algebraic definition, cost estimate function, and algebraic 
properties that can be used by the query optimizer to gen- 
erate efficient query plans. 

4. PROVENANCE MANAGEMENT 

Biologists commonly interact and exchange data with each 
other. Tracking the provenance (lineage) of data is very 
important in assessing the value and credibility of the data. 
Similar to annotations, data provenance can be attached 
to the database at multiple granularities, i.e., at the table, 
column, tuple levels, or any sub-groupings and subsets of the 
data. Also, biological data can be queried by its provenance. 
For example (refer to Figure |SJ), one table may contain data 
from multiple sources, e.g., Si and S2, or data that is locally 
inserted. Then, some values may be updated by a certain 
program, e.g., PI, and some columns may be overwritten 
by data from another source, e.g., S3. Then, users may be 
interested to know the source of some values at a certain 
moment in time. 

In bdbms, we treat provenance data as a kind of anno- 
tations where all the requirements and functionalities dis- 
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Figure 9: Local dependency tracking 

cussed in Section [3] are also applicable to provenance data. 
However, provenance data has special requirements and 
characteristics that need to be addressed including: 

• Structure of provenance data: Unlike annotations 
that can be free text, provenance data usually has 
well-defined structure. For example, the names of the 
source database and the source table draw their values 
from a list of pre-defined values. Supporting XML- 
formatted annotations can be beneficial in structur- 
ing provenance data. For example, provenance data 
can follow a predefined XML schema that needs to be 
stored and enforced by the database system. 

• Authorization over provenance data: End-users 
are usually not allowed to insert or update the prove- 
nance data. Provenance data needs to be automat- 
ically inserted and maintained by the system. For 
example, integration tools that copy the data from 
one database to another can be the only tools that 
insert the provenance information. End-users can 
only retrieve or propagate this information. There- 
fore, we need to provide an access control mechanism 
over the provenance data (and annotations in general) 
to restrict the annotation operations, e.g., addition, 
archival, and propagation, to certain users or programs 
as required. 

5. LOCAL DEPENDENCY TRACKING 

Biological databases are full of dependencies and deriva- 
tions among data items. In many cases, these dependencies 
and derivations cannot be automatically computed using 
coded functions, e.g., stored procedures or functions inside 
the database. Instead, they may involve prediction tools, 
lab experiments, or instruments to derive the data. Us- 
ing integrity constraints and triggers to maintain the con- 
sistency of the data is limited to computable dependencies, 
i.e., dependencies that can be computed via coded functions. 
However, non-computable dependencies cannot be directly 



handled using integrity constraints and triggers. In Figure[9] 
we give an example of the dependencies that can be found 
in biological databases. In Figure HJa), protein sequences 
are derived from the gene sequences using a prediction tool 
P, whereas the function of the protein is derived from the 
protein sequence using lab experiments. If a gene sequence 
is modified, then all protein sequences that depend on that 
gene have to be marked as outdated until their values are re- 
verified. Moreover, the function of the outdated proteins has 
to be marked as outdated until their values are re-evaluated. 

In Figure |9jb), we present another type of dependency 
where the value of the data in the database depends on the 
procedure or program that generated that data. For exam- 
ple, the values in the Evalue column (Figure[9jb)) depend on 
Procedure BLAST-2.2. 15. If a newer version of BLAST is 
used or BLAST is replaced with another procedure, then 
we need to re-evaluate the values in the Evalue column. 
These values can be automatically evaluated if BLAST can 
be modeled as a database function. Otherwise, the values 
have to be marked as Outdated. 

In bdbms, we propose to extend the concept of Func- 
tional Dependencies [5] [T3] to Procedural Dependencies. In 
Procedural Dependencies, we not only track the dependency 
among the data, but also the type and characteristics of the 
dependency, e.g., the procedure on which the dependency is 
based, whether or not that procedure can be executed by the 
database, and whether or not that procedure is invertible. 
For example, we can model the dependencies in Figure [9] 
using the following rules. 

Gene.GSequence Prediction 1001 p > Protein.PSequence (1) 
(Executable, 
non-invertible) 

Protein.PSequence L ab experiment > p rotein .p Function ( 2 ) 
(non-executable, 
non-invertible) 

GeneMatching.Gene1,GeneMatching.Gene2 — BLAST-2.2.15 f GeneMatching.Evalue (3) 

(Executable, 
non-invertible) 

Rule 1 specifies that Column PSequence in Table Protein 
depends on Column GSequence in Table Gene through the 
prediction tool P that is executable by the database and 
is non-invertible. Rule 2 specifies that column PFunction 
in Table Protein depends on Column PSequence through a 
lab experiment that is not executable by the database and 
is non-invertible. Rule 3 specifies that Column Evalue in 
Table GeneMatching depends on both columns Genel and 
Gene2 through Program BLAST-2.2.15 that is executable 
by the database and is non-invertible. For example, from 
Rule 2, we infer that when Column PSequence changes, the 
database can only mark PFucntion as Outdated. In con- 
trast, based on Rule 3, when either of the Genel or Gene2 
columns or Procedure BLAST-2.2.15 change, the database 
can automatically re-evaluate Evalue. 

In addition, the notion of Procedural Dependencies allows 
us to reason about the dependency rules. For example, in 
addition to the closure of an attribute, we can compute the 
closure of a procedure, i.e., all data in the database that 
depend on a specific procedure. We can also derive new 
rules, for example, based on rules (1) and (2) above, we can 



derive the following rule: 

Prediction too! P, 

Gene.GSequence — lab experiment ( Protein. PFunction (4) 
(non-executable, 
non-invertible) 

Rule 4 specifies that Column PFunction in Table Protein 
depends on Column GSequence in Table Gene through a 
chain of two procedures, a perdition tool P and a lab ex- 
periment. This chain is non-executable by the database and 
is non-invertible. Notice that the chain is non-executable 
because at least one of the procedures, namely the lab ex- 
periment, is non-executable. 

In bdbms, we address the following functionalities to track 
local dependencies: 

• Modeling dependencies: We use Procedural De- 
pendencies to allow users to model the dependencies 
among the database items as well as for bdbms to rea- 
son about these dependencies, for example, to detect 
conflicts and cycles among dependency rules, and to 
compute the closure of procedures. 

• Storing dependencies: Dependencies among the 
data can be either at the schema level, i.e., the en- 
tity level, or at the instance level, i.e., the cell level. 
Schema-level dependencies can be modeled using for- 
eign key constraints, e.g., protein sequences depend 
on gene sequences and they are linked by a foreign 
key. Instance-level dependencies are more complex to 
model because they are on a cell-by-cell basis. In this 
case, we can use dependency graphs to model such de- 
pendencies. 

• Tracking outdated data: When the database is 
modified, bdbms uses the dependency graphs to fig- 
ure out which items, termed the outdated items, may 
be affected by this modification. Outdated items need 
to be marked such that these items can be identified 
in any future reference. We propose to associate a 
bitmap with each table in the database. A cell in 
the bitmap is set to 1 if the corresponding cell in the 
data table is outdated, otherwise the bitmap cell is set 
to 0. For example, assume that the sequences corre- 
sponding to genes JW0080 and JW0082 (Figure M. a)) 
are modified, then the bitmap associated with Table 
Protein will be as illustrated in Figure 1101 Notice 
that the bits corresponding to PSequence are not set 
to 1 because PSequence is automatically updated by 
executing Procedure P. In contrast, PFunction cannot 
be automatically updated, therefore its corresponding 
bits are set to 1 to indicate that these values are out- 
dated. To reduce the storage overhead of the main- 
tained bitmaps, data compression techniques such as 
Run-Length-Encoding [23] can be used to effectively 
compress the bitmaps. 

• Reporting and annotating outdated data: The 

main objective of tracking local dependencies is that 
the database should be able to report at all times the 
items that need to be verified or re-evaluated. More- 
over, when a query executes over the database and 
involves outdated items, the database should propa- 
gate with those items an annotation specifying that 
the query answer may not be correct. Detecting the 
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outdated items at query execution time is a challenging 
problem as it requires retrieving and propagating the 
status of each item, i.e., whether it is outdated or not, 
in the query pipeline. A proposed solution is to con- 
sider the status of the database items as annotations 
attached to those items. These annotations will be au- 
tomatically propagated along with the query answers 
as discussed in Section [3] 

• Validating outdated data: bdbms will provide a 
mechanism for users to validate outdated items. An 
outdated item may or may not need to be modified 
to become valid. For example, a modification to a 
gene sequence may not affect the corresponding pro- 
tein sequence. In this case, the protein sequence will 
be revalidated without modifying its value. 

6. UPDATE AUTHORIZATION 

Changes over the database may have important conse- 
quences, and hence, they should be subject to authorization 
and approval by authorized entities before these changes 
become permanent in the database. Update authoriza- 
tion (also termed approval enforcement) in current database 
management systems is based on GRANT/REVOKE access 
models 1181 [241 , where a user may be granted an authoriza- 
tion to update a certain table or attribute. Although widely 
acceptable, these authorization models are based only on 
the identity of the user not on the content of the data be- 
ing inserted or updated. In biological databases, it is of- 
ten the case that a data item can make it permanently to a 
database based on its value not on the user who entered that 
value. For example, a lab administrator may allow his/her 
lab members to perform insert and update operations over 
the database. However, for reliability, these operations have 
to be revised by the lab administrator. If the lab admin- 
istrator is the only user who has the right to update the 
database, then this person may become a bottleneck in the 
process of populating the database. 

In bdbms, we introduce an approval mechanism, termed 
content-based approval, that allows the database to system- 
atically track the changes over the database. The pro- 
posed content-based approval mechanism works with, not 
in replacement to, existing GRANT/REVOKE mechanisms. 



The content-based approval mechanism maintains a log of all 
update operations, i.e., INSERT, UPDATE, and DELETE, 
that occur in the database. The database administrator can 
turn the content-based approval feature ON or OFF for a 
certain table or columns using a Start Content Approval and 
End Content Approval commands (Figure [TTJ, respectively. 
The tablejname value specifies the user table on which the 
update operations will be monitored. The optional clause 
COLUMNS specifies which column(s) in tablejname to mon- 
itor. For example, we can monitor the update operations 
over only Column GSequence of Table Gene (Figure (9{ a)). 
The APPROVED BY clause specifies the user or group of 
users who can approve or disapprove the update operations. 
If the content-based approval feature is turned ON over Ta- 
ble T, then bdbms stores all update operations over T in 
the log along with an automatically generated inverse state- 
ment that negates the effect of the original statement. More 
specifically, for INSERT, a DELETE statement will be gen- 
erated, for DELETE, an INSERT statement will be gener- 
ated, and for UPDATE, another UPDATE statement that 
restores the old values will be generated. The log stores 
also the user identifier who issued the update operation and 
the issuing time. The person in charge of the database, 
e.g., the lab administrator, can then view the maintained 
log and revise the updates that occurred in the database. 
If an operation is disapproved, then bdbms executes the in- 
verse statement of that operation to remove its effect from 
the database. Executing the inverse statement may affect 
other elements in the database, e.g., elements that depend 
on the currently existing values. It is the functionality of 
the Local Dependency Tracking feature (Section [5]) to track 
and invalidate these elements. 

7. INDEXING AND QUERY PROCESSING 

Biological databases warrant the use of non-traditional in- 
dexing mechanisms beyond B+- trees and hash tables. To 
enable biological algorithms to operate efficiently on the 
database, we propose integrating non-traditional indexing 
techniques inside bdbms. We focus on two fronts: (1) Sup- 
porting multidimensional datasets via multidimensional in- 
dexing techniques (suitable for protein 3D structures and 
surface shape matching), and (2) Supporting compressed 
datasets via novel external-memory indexes that work over 
the compressed data without decompressing it (suitable for 
indexing large sequences). 

In bdbms, we focus on introducing non-traditional index 
structures for supporting biological data. For example, com- 
pressing the data inside the database is proven to improve 
the system performance, e.g., C-store [33]. It reduces signif- 
icantly the size of the data, the number of I/O operations 
required to retrieve the data, and the buffer requirements. 
In bdbms, we investigate how we can store biological data 
in compressed form and yet be able to operate, e.g., index, 
search, and retrieve, on the compressed data without de- 
compressing it. 

7.1 Indexing Multi-dimensional Data 

Space-partitioning trees are a family of access methods 
that index objects in a multi-dimensional space, e.g., pro- 
tein 3D structures. In [3] LU LTJ3 EH] , we introduce an exten- 
sible indexing framework, termed SP-GiST, that broadens 
the class of supported indexes to include disk-based versions 
of space-partitioning trees, e.g., disk-based trie variants, 
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Figure 12: Indexing and querying RLE-compressed 
sequences 



quadtree variants, and kd-trees. As an extensible indexing 
framework, SP-GiST allows developers to instantiate a vari- 
ety of index structures in an efficient way through pluggable 
modules and without modifying the database engine. The 
SP-GiST framework is implemented inside PostgreSQL |34j 
and we use it in bdbms. Several index structures have been 
instantiated using SP-GiST, e.g., variants of the trie [lit 
[2D] , the kd-tree [BJ, the point quadtree [15], and the PMR 
quadtree [29]. We implemented several advanced search op- 
erations, e.g., k-nearest-neighbor search, regular expression 
match search, and substring searching. The experimental 
results in [16] demonstrate the performance potential of the 
class of space-partitioning tree indexes over the B+- tree and 
R-tree indexes, for the operations above. In addition to the 
performance gains and the advanced search functionalities 
provided by SP-GiST indexes, it is the ability to rapidly pro- 
totype these indexes inside bdbms that is most attractive. 

A key challenge is to integrate SP-GiST indexes inside bi- 
ological analysis algorithms such as protein structure align- 
ment algorithms. Providing the index structures is the first 
step to improve the querying and processing capabilities of 
the analysis algorithms. 

7.2 Indexing Compressed Data 

Biological databases consist of large amounts of sequence 
data, e.g., genes, alleles, and protein primary and secondary 
structures. These sequences need to be stored, indexed, and 
searched efficiently. In bdbms, we propose to investigate 
new techniques for compressing biological sequences and op- 
erating over the compressed data without decompressing 
it. Sequence compression has been addressed recently in 
the C-Store database management system [33], where some 
operators, e.g., aggregation operators, can operate directly 
over the compressed data. Sequence compression is demon- 
strated to improve system performance as it reduces the size 
of the data significantly. 

In bdbms, as a first step, we investigate the processing, 
e.g., indexing and querying, of Run-Length- Encoded (RLE) 



sequences. RLE [23] is a compression technique that re- 
places the consecutive repeats of a character C by one oc- 
currence of C followed by C's frequency. One of the main 
challenges is how to operate on the compressed data with- 
out decompressing it. In [17], we proposed an index struc- 
ture, termed the SBC-tree (String _B-tree for Compressed 
sequences), for indexing and searching RLE-compressed se- 
quences of arbitrary length. In Figure [12] we illustrate how 
protein secondary structure sequences are stored in bdbms. 
We first compress the sequences using RLE, and then build 
an SBC-tree index over the compressed sequences. Queries 
over the sequences will use the index to retrieve the desired 
data without decompression. The SBC-tree is a two-level 
index structure based on the well-known String B-tree and 
a 3-sided range query structure. The SBC-tree supports 
substring as well as prefix matching, and range search oper- 
ations over RLE-compressed sequences. The SBC-tree has 
an optimal external-memory space complexity as well as op- 
timal search time for substring matching, prefix matching, 
and range search queries. More interestingly, SBC-tree has 
shown to be very practical to implement. The SBC-tree in- 
dex is prototyped in PostgreSQL with an R-tree in place of 
the 3-sided structure. Preliminary performance results il- 
lustrate that using the SBC-tree to index RLE-compressed 
protein sequences achieves up to an order of magnitude re- 
duction in storage, up to 30% reduction in I/Os for the 
insertion operations, and retains the optimal search perfor- 
mance achieved by the String B-tree over the uncompressed 
sequences. 

In bdbms, we plan to address the following challenges re- 
garding the processing of compressed data: 

• Full integration of the SBC-tree index: To fully 
integrate the SBC-tree index inside bdbms we plan to 
address several query processing and optimization is- 
sues including: (1) supporting subsequence matching, 
and (2) providing accurate cost functions for estimat- 
ing the cost of the index. Subsequence matching is an 
important operation over biological sequences as it is 
used in many algorithms such as sequence alignment 
algorithms. We plan to extend the supported oper- 
ations of the SBC-tree index to include subsequence 
matching. 

• Processing various formats of compressed data: 

Currently, bdbms supports indexing and querying 
RLE-compressed sequence data. RLE is effective in 
the case of sequences where characters have long re- 
peats in tandem. Compression techniques like gzip 
and Burrows- Wheeler Transform (BWT) can be more 
effective in compressing the other kinds of data. Our 
plan is to investigate indexing and querying other 
formats of compressed data in addition to RLE- 
compressed sequences to efficiently support these data 
inside bdbms. 

8. RELATED WORK 

Periscope [301136] is an ongoing project that aims at defin- 
ing a declarative query language for querying biological data. 
Periscope/SQ [36] , a component of Periscope, introduces 
new operators and data types that facilitate the process- 
ing and querying of sequence data. While the main focus of 
Periscope is on defining and supporting a new declarative 
query language, bdbms focuses on other functionalities that 



are required by biological databases, e.g., annotation and 
provenance management, local dependency tracking, update 
authorization, and non-traditional access methods. 

Several annotation systems have been built to manage 
annotations over the web, e.g., [D HS1 HZ1 El El [3U ■ Bio- 
das (Biological Distributed Annotation System) [TJ [32] and 
Human Genome Browser 27] are specialized biological an- 
notation systems to annotate genome sequences. They allow 
users to integrate genome annotation information from mul- 
tiple web servers. Managing annotations and provenance in 
relational databases has been addressed in [7] [8] 1101 1121 1211 
135] . In these techniques provenance data is pre-computed 
and stored inside the database as annotations. The main 
focus of these techniques is to propagate the annotations 
along with the query answer. Other aspects of annotation 
management, e.g., insertion, storage, and indexing, have not 
been addressed. Another approach for tracking provenance, 
termed the lazy approach, has been addressed in [9] 1141 1151 
138] . where provenance data is computed at query time. Lazy 
approach techniques require that the derivation steps of the 
data to be known and to be invertible such that the prove- 
nance information can be computed. In bdbms, we treat 
provenance data as a kind of annotations because the deriva- 
tion of biological data is usually ad-hoc and does not neces- 
sarily follow certain functions or queries. 

The access control and authorization process in cur- 
rent database systems is based on the GRANT/REVOKE 
model |18II24] , Although widely acceptable, this model lacks 
being content-based, i.e., the authorization is based only on 
the identity of the user. In bdbms, we propose the content- 
based approval model that is based on the data as well as on 
the identity of the user. 

9. CONCLUDING REMARKS 

Two applications have been driving the bdbms project: 
building a database resource for the Escherichia coli (E. coli) 
model organism and a protein structure database project. 
Through these two projects, we realized the need for the 
functionalities that we address in bdbms, namely (1) Anno- 
tation and provenance management, (2) Local dependency 
tracking, (3) Update authorization, and (4) Non-traditional 
and novel access methods. 

bdbms is currently being prototyped using PostgreSQL. 
In parallel work, we have extended relational algebra to op- 
erate on "annotated" relations. The A-SQL language and 
the content-based authorization model are currently under 
development in PostgreSQL. The SP-GIST and SBC-tree 
access methods are already integrated inside PostgreSQL. 
We are currently studying several optimizations, cost esti- 
mates, and complex operations over these indexes. 
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